32 research outputs found

    Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler

    Get PDF
    BACKGROUND: Despite the short length of their reads, micro-read sequencing technologies have shown their usefulness for de novo sequencing. However, especially in eukaryotic genomes, complex repeat patterns are an obstacle to large assemblies. PRINCIPAL FINDINGS: We present a novel heuristic algorithm, Pebble, which uses paired-end read information to resolve repeats and scaffold contigs to produce large-scale assemblies. In simulations, we can achieve weighted median scaffold lengths (N50) of above 1 Mbp in Bacteria and above 100 kbp in more complex organisms. Using real datasets we obtained a 96 kbp N50 in Pseudomonas syringae and a unique 147 kbp scaffold of a ferret BAC clone. We also present an efficient algorithm called Rock Band for the resolution of repeats in the case of mixed length assemblies, where different sequencing platforms are combined to obtain a cost-effective assembly. CONCLUSIONS: These algorithms extend the utility of short read only assemblies into large complex genomes. They have been implemented and made available within the open-source Velvet short-read de novo assembler

    CONDOR: a database resource of developmentally associated conserved non-coding elements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Comparative genomics is currently one of the most popular approaches to study the regulatory architecture of vertebrate genomes. Fish-mammal genomic comparisons have proved powerful in identifying conserved non-coding elements likely to be distal <it>cis-</it>regulatory modules such as enhancers, silencers or insulators that control the expression of genes involved in the regulation of early development. The scientific community is showing increasing interest in characterizing the function, evolution and language of these sequences. Despite this, there remains little in the way of user-friendly access to a large dataset of such elements in conjunction with the analysis and the visualization tools needed to study them.</p> <p>Description</p> <p>Here we present CONDOR (COnserved Non-coDing Orthologous Regions) available at: <url>http://condor.fugu.biology.qmul.ac.uk</url>. In an interactive and intuitive way the website displays data on > 6800 non-coding elements associated with over 120 early developmental genes and conserved across vertebrates. The database regularly incorporates results of ongoing <it>in vivo </it>zebrafish enhancer assays of the CNEs carried out in-house, which currently number ~100. Included and highlighted within this set are elements derived from duplication events both at the origin of vertebrates and more recently in the teleost lineage, thus providing valuable data for studying the divergence of regulatory roles between paralogs. CONDOR therefore provides a number of tools and facilities to allow scientists to progress in their own studies on the function and evolution of developmental <it>cis</it>-regulation.</p> <p>Conclusion</p> <p>By providing access to data with an approachable graphics interface, the CONDOR database presents a rich resource for further studies into the regulation and evolution of genes involved in early development.</p

    Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques

    Get PDF
    Determining the underlying haplotypes of individual human genomes is an essential, but currently difficult, step toward a complete understanding of genome function. Fosmid pool-based next-generation sequencing allows genome-wide generation of 40-kb haploid DNA segments, which can be phased into contiguous molecular haplotypes computationally by Single Individual Haplotyping (SIH). Many SIH algorithms have been proposed, but the accuracy of such methods has been difficult to assess due to the lack of real benchmark data. To address this problem, we generated whole genome fosmid sequence data from a HapMap trio child, NA12878, for which reliable haplotypes have already been produced. We assembled haplotypes using eight algorithms for SIH and carried out direct comparisons of their accuracy, completeness and efficiency. Our comparisons indicate that fosmid-based haplotyping can deliver highly accurate results even at low coverage and that our SIH algorithm, ReFHap, is able to efficiently produce high-quality haplotypes. We expanded the haplotypes for NA12878 by combining the current haplotypes with our fosmid-based haplotypes, producing near-to-complete new gold-standard haplotypes containing almost 98% of heterozygous SNPs. This improvement includes notable fractions of disease-related and GWA SNPs. Integrated with other molecular biological data sets, this phase information will advance the emerging field of diploid genomics

    Retroviral integrations contribute to elevated host cancer rates during germline invasion

    Get PDF
    © 2021, The Author(s). Repeated retroviral infections of vertebrate germlines have made endogenous retroviruses ubiquitous features of mammalian genomes. However, millions of years of evolution obscure many of the immediate repercussions of retroviral endogenisation on host health. Here we examine retroviral endogenisation during its earliest stages in the koala (Phascolarctos cinereus), a species undergoing germline invasion by koala retrovirus (KoRV) and affected by highcancerprevalence. We characterise KoRV integration sites (IS) in tumour and healthy tissues from 10 koalas, detecting 1002 unique IS, with hotspots of integration occurring in the vicinity of known cancer genes. We find that tumours accumulate novel IS, with proximate genes over-represented for cancer associations. We detect dysregulation of genes containing IS and identify a highly-expressed transduced oncogene. Our data provide insights into the tremendous mutational load suffered by the host during active retroviral germline invasion, a process repeatedly experienced and overcome during the evolution of vertebrate lineages

    Early Evolution of Conserved Regulatory Sequences Associated with Development in Vertebrates

    Get PDF
    Comparisons between diverse vertebrate genomes have uncovered thousands of highly conserved non-coding sequences, an increasing number of which have been shown to function as enhancers during early development. Despite their extreme conservation over 500 million years from humans to cartilaginous fish, these elements appear to be largely absent in invertebrates, and, to date, there has been little understanding of their mode of action or the evolutionary processes that have modelled them. We have now exploited emerging genomic sequence data for the sea lamprey, Petromyzon marinus, to explore the depth of conservation of this type of element in the earliest diverging extant vertebrate lineage, the jawless fish (agnathans). We searched for conserved non-coding elements (CNEs) at 13 human gene loci and identified lamprey elements associated with all but two of these gene regions. Although markedly shorter and less well conserved than within jawed vertebrates, identified lamprey CNEs are able to drive specific patterns of expression in zebrafish embryos, which are almost identical to those driven by the equivalent human elements. These CNEs are therefore a unique and defining characteristic of all vertebrates. Furthermore, alignment of lamprey and other vertebrate CNEs should permit the identification of persistent sequence signatures that are responsible for common patterns of expression and contribute to the elucidation of the regulatory language in CNEs. Identifying the core regulatory code for development, common to all vertebrates, provides a foundation upon which regulatory networks can be constructed and might also illuminate how large conserved regulatory sequence blocks evolve and become fixed in genomic DNA

    Functional Analysis of Conserved Non-Coding Regions Around the Short Stature hox Gene (shox) in Whole Zebrafish Embryos

    Get PDF
    Background: Mutations in the SHOX gene are responsible for Leri-Weill Dyschondrosteosis, a disorder characterised by mesomelic limb shortening. Recent investigations into regulatory elements surrounding SHOX have shown that deletions of conserved non-coding elements (CNEs) downstream of the SHOX gene produce a phenotype indistinguishable from Leri-Weill Dyschondrosteosis. As this gene is not found in rodents, we used zebrafish as a model to characterise the expression pattern of the shox gene across the whole embryo and characterise the enhancer domains of different CNEs associated with this gene. Methodology/Principal Findings: Expression of the shox gene in zebrafish was identified using in situ hybridization, with embryos showing expression in the blood, putative heart, hatching gland, brain pharyngeal arch, olfactory epithelium, and fin bud apical ectodermal ridge. By identifying sequences showing 65% identity over at least 40 nucleotides between Fugu, human, dog and opossum we uncovered 35 CNEs around the shox gene. These CNEs were compared with CNEs previously discovered by Sabherwal et al. ,resulting in the identification of smaller more deeply conserved sub-sequence. Sabherwal et al.’s CNEs were assayed for regulatory function in whole zebrafish embryos resulting in the identification of additional tissues under the regulatory control of these CNEs. Conclusion/Significance: Our results using whole zebrafish embryos have provided a more comprehensive picture of the expression pattern of the shox gene, and a better understanding of its regulation via deeply conserved noncoding elements. In particular, we identify additional tissues under the regulatory control of previously identified SHOX CNEs. We also demonstrate the importance of these CNEs in evolution by identifying duplicated shox CNEs and more deeply conserved sub-sequences within already identified CNEs

    Ancient duplicated conserved noncoding elements in vertebrates: A genomic and functional analysis

    No full text
    Fish–mammal genomic comparisons have proved powerful in identifying conserved noncoding elements likely to be cis-regulatory in nature, and the majority of those tested in vivo have been shown to act as tissue-specific enhancers associated with genes involved in transcriptional regulation of development. Although most of these elements share little sequence identity to each other, a small number are remarkably similar and appear to be the product of duplication events. Here, we searched for duplicated conserved noncoding elements in the human genome, using comparisons with Fugu to select putative cis-regulatory sequences. We identified 124 families of duplicated elements, each containing between two and five members, that are highly conserved within and between vertebrate genomes. In 74% of cases, we were able to assign a specific set of paralogous genes with annotation relating to transcriptional regulation and/or development to each family, thus removing much of the ambiguity in identifying associated genes. We find that duplicate elements have the potential to up-regulate reporter gene expression in a tissue-specific manner and that expression domains often overlap, but are not necessarily identical, between family members. Over two thirds of the families are conserved in duplicate in fish and appear to predate the large-scale duplication events thought to have occurred at the origin of vertebrates. We propose a model whereby gene duplication and the evolution of cis-regulatory elements can be considered in the context of increased morphological diversity and the emergence of the modern vertebrate body plan
    corecore